Comparison of Time-Course Short and Long Read RNA-seq Data

Author

Anna Toidze, Marie-Claire Indilewitsch, Michel Tarnow

Published

January 11, 2024

Introduction

The read throughput of long-read RNA-sequencing methods, for example the Pacific Biosciences (PacBio) platform, are a subject to improvement especially in connection to RNA isoform identification and quantification (Hardwick et al, 2019). This is particularly of interest when looking at alternative splicing which is a regulatory process that influences coding sequences, as well as translation efficiency, stability and other genetic processes by the differential splicing of exons during transcript maturation. Thereby it is a factor in the development of several diseases (Baralle et al, 2017). Correctly identifying alternative transcript isoforms is challenging as short-read technologies like Illumina have too short read lengths (50-600 bp, necessary would be > 5 kB) to correctly identify the splice sites even though they have a high throughput. Long-read technologies like PacBio or Oxford Nanopore enable sequencing with the necessary read lengths for RNA isoform identification and quantification, however, these platforms cannot attain the necessary read depths since they suffer from low read depths/throughput (smaller than the necessary 2 * 10^7) (Al’Khafaji et al, 2023).

A newly introduced method for long-read sequencing of transcript isoforms based on ISO-seq is multiplexed arrays isoform sequencing (MAS-ISO-seq), which was introduced by Al’Khafaji et al (2023). MAS-ISO-seq relies on the concatenation of cDNA fragments into longer sequence libraries with a narrow length distribution, the workflow is depicted in figure 1, and thereby exploits the full capability of the Ile Sequel platform compared to others protocols. In short, the cDNA fragments are distributed over multiple PCR’s, each of which appends specific barcode adapters containing deoxy-uracil (dU). Following, the amplified samples are pooled again and undergo dU digestion. Finally, the MAS-ISO-seq workflow creates concatenated cDNA fragments by barcode-directed adapter ligation of the original cDNA fragments. These concatenated fragments can then be used for PacBio sequencing (Al’Khafaji et al, 2023).

Figure 1: MAS-ISO-seq workflow. MAS-ISO-seq relies on the concatenation of cDNA fragments into longer sequence libraries to fully use the capacities of the PacBio Sequel Ile platform. Figure from Al’Khafaji et al (2023).

The performance of the MAS-ISO-seq protocol was tested by Al’Khafaji et al (2023) using different sample sets. In summary, the authors could show that RNA isoforms could be identified much more confidently with the MAS-ISO-seq protocol than with short-read methods when applied to both a SIRV sample set and a single cell sample set. Moreover, error robustness and >15-fold increase in throughput of the PacBio Sequel Ile platform could be shown (Al’Khafaji et al, 2023).

Here, we performed a similar analysis comparing two different count data sets, one derived from the MAS-ISO-seq protocol (long-read sequencing) and one derived from Illumina sequencing (short-read sequencing). The data sets contain samples from WTC-11 cells (induced pluripotent stem cells) that underwent a directed differentiation from pluripotent stem cells to hemogenic endothelial cells. More details can be found here: directed-differentiation-hemogenic-endothelial-cells-from-human. For six days, 1 – 3 samples were prepared and sequenced using both the PacBio and Illumina platforms. The data was subsequently quantified with Bambu (Chen et al, 2023) in the case of the PacBio sequence data and with Salmon (Patro et al, 2017) in the case of Illumina sequence data. The quantification was performed using 4 different novel discovery rate (NDR) thresholds (NDR thresholds of 0.025, 0.05, 0.1 and 0.2 were applied here). The obtained data sets were finally compared in an exploratory data analysis (EDA), a differential gene expression (DGE) analysis and a differential transcript usage (DTU) analysis to explore the performance of the MAS-ISO-seq protocol compared to Illumina sequencing, which has been used and thus underwent optimization for much longer.

Methods

The data analysis steps were performed in RStudio (Posit team (2023), version 2023.6.2.561, R version 4.3.1).

Exploratory Comparison of Illumina+Salmon and PacBio+Bambu

First, the data was imported and preprocessed (see separate import_processing.qmd). Hereby, the control sequences were filtered out and counts and transcripts-per-million (TPM) saved separately for Illumina+Salmon (short read) data and PacBio+Bambu (long read) data. Further, the 16 different data frames (8 for TPM, 8 for counts) were sorted to have the same order and saved in .rds files (bambu, salmon and metadata) which were then used for further analysis. Then a first direct exploratory analysis was performed to get an overview of the data.

Exploratory Analysis

The exploratory analysis was divided into a first direct comparison looking at the differences in the counts/TPM between Illumina+Salmon and PacBio+Bambu for all 4 NDR thresholds (0.025, 0.05, 0.1, 0.2) and visualizing them with histograms, followed by the computation of correlations between the different samples of Illumina+Salmon and PacBio+Bambu also stratified by transcript length and lastly looking at the MDS plots. The whole analysis was performed for count data as well as TPM data but it was decided to keep the TPM in the report file as it is more representative when performing the stratification by length (further explained in discussion). But the count plots as well as all other TPM plots not shown in the following can be found in the file comparison_pacbio_illumina.qmd.

Differences Illumina+Salmon vs. PacBio+Bambu

First, the direct differences between the Illumina+Salmon and PacBio+Bambu data frames were calculated and visualized using histograms to see how different the raw data is. The histograms were only plotted for the first column of the data frame (sample 1) as plotting all columns would not lead to an increase in information.

Correlation Illumina+Salmon vs. PacBio+Bambu

Secondly, the correlation between Illumina+Salmon and PacBio+Bambu was looked at first not stratified by length and secondly, stratified by length on the transcript level (TPM) for all 4 NDR thresholds Hereby the samples were divided into 10 bins always portraying an interval of transcript length of 1000 (0-1000, 1000-2000 and so on). Thus, the bins do not have the same size. Both pearson and spearman correlation between all samples were performed and visualized in heat maps but in the results we decided on showing only the spearman plots (explained in the discussion). For all 10 bins the correlations were summed up and shown in a scatter plot to see the effect of transcript length for all 4 NDR thresholds.

MDS plots

Lastly, the MDS plots were plotted for all 8 data frames (here we used the count data frames) to compare the sample distribution between Illumina+Salmon and PacBio+Bambu.

Differential Gene Expression and Gene Set Analyses

Differential Gene Expression Analysis

The aim of the DGE analysis was twofold: first, the aim was to identify genes that are differentially expressed (DE) over the course of time, so over the course of the 5 days at which samples have been generated; second, the aim was to compare the results of the DGE analysis of the Illumina+Salmon data (short-read data) with the results of the PacBio+Bambu data (long-read data). To do so, in total 8 different count data frames were analyzed. For each of the 4 NDR thresholds (0.025, 0.05, 0.1, 0.2), there are two different count data frames, one derived from Illumina+Salmon and one derived from PacBio+Bambu. For the analysis of DE genes, a standard pipeline from the R package edgeR (Robinson et al, 2010, R package version 3.42.4) was used.

Before performing the DGE analysis itself, some genes of interest were investigated. These genes are targeted by the directed differentiation treatment the cells underwent, and therefore might show changes in their expression over time. Here, the following genes were investigated:

  • CTNNB1 (catenin beta 1)
  • ETV2 (ETS variant transcription factor 2)

Afterwards, the DGE analysis followed. The most important steps of the analysis included filtering of genes that do not have a sufficient count in a worthwhile number of samples, calculation of scaling factors (for library sizes) and posterior dispersion estimates, fitting of a negative binomial generalized log-linear model to each gene and performing a likelihood ratio test. To do so, a design matrix was created using the following formula: ~ day. Day specifies the day at which each sample was created, so days 0 to 5. Thus, the design matrix was created in a way that the first time point, day 0, is the baseline (intercept of the model). In the likelihood ratio tests, the full model was compared to a reduced model which has values of zero for all coefficients but the baseline, so for days 1 to 5. For each gene, p-values and other test statistics of the likelihood ratio test were computed and used afterwards to identify genes that show statistical significant differential expression. Adjusted the p-values were computed by the Benjamini-Hochberg method (Benjamini and Hochberg, 1995) and then DE genes were filtered by a significance threshold of 0.01.

To compare the results for Illumina+Salmon with the results for PacBio+Bambu, an UpSetR (Gehlenborg, 2019, R package version 1.4.0) plot was created for each NDR threshold. This plot overlayed the genes called DE for the two different cases. Moreover, the p-values and logFC values (comparing day 0 and day 5) for each gene were compared using simple scatter plots.

Grouping of Differentially Expressed Genes

After the DGE analysis, the genes identified as DE were grouped into gene sets. For this grouping, only the genes that were identified as DE by the analysis of both the Illumina+Salmon and the PacBio+Bambu data of the 0.2 NDR threshold were considered. Moreover, the genes were grouped depending on their Illumina+Salmon counts since these counts were treated as a ground truth to which the PacBio+Bambu counts were compared.

Of these genes, the average expression at each time point was calculated first, so the average of the samples of each time point. Then, all gene-gene Spearman-correlations were computed and 1-correlation was used as the distance between two genes during hierarchical clustering. The results of the hierarchical clustering were then evaluated to group the genes into clusters. From visual inspection of a cluster dendrogram and a correlation heat map, the number of clusters was set manually. Lastly, the average expression of the genes of the clusters at each time point was calculated for each cluster.

The clusters were then used as gene sets for the following gene set analysis.

Gene Set Analysis

To make sense of the genes that were called DE, they were first grouped into gene sets of similar genes (see above). These gene sets were then used as input for a gene set analysis. Here, the frequency-based method DAVID (Sherman et al, 2022, https://david.ncifcrf.gov/home.jsp) was employed.

The Database for Annotation, Visualization and Integrated Discovery (DAVID) tool is a web tool for functional annotation and enrichment analyses of gene sets, e.g. to find enriched gene ontology terms. To do so, the web tool requires the user to upload not only a list of genes, so the gene set of interest, but also another list of genes, the background. The tool then searches for functional annotations that are statistically overrepresented (enriched) in the gene set of interest compared to the background. Therefore, the choice of background can influence the results of the analysis. Here, the gene sets of interest were the clusters created during the grouping of the DE genes and as background all genes that remained after the filtering were used. The only exception to that are genes that were newly discovered during the transcript quantification since these genes do not have a valid ensemble gene ids which is a requirement of DAVID.

Differential Transcript Usage Analysis

After detecting differentially expressed genes, differentially expressed isoforms were identified using Differential Transcript Usage (DTU) analysis pipeline using DRIMSeq and DEXSeq libraries. DTU complements DGE as it enables the identification of alternative splicing and isoform switches even if the gene expression does not change between conditions (Marques-Coelho et al, 2021; Love et al, 2018). As for DGE, the aim was to identify differentially expressed isoforms in samples generated over course of the 5 days and to compare the results of the differential gene expression analysis of the Illumina+Salmon data (short read data) with the results of the PacBio+Bambu data (long read data).

The available workflows (Love et al, 2018; Lin, 2021) were tested on the given dataset (link file DTU_test.qmd). The workflow made use of 3 libraries vignettes: DRIMSeq, DEXSeq and stageR. DEXSeq assumes that the feature (transcript) counts follow a negative binomial (NB) distribution and considers them as relative proportions of the group (the gene) using an interaction term in a generalized linear model (GLM) (Love et al, 2018; Anders et al, 2012). DRIMSeq assumes that feature proportions follow the Dirichlet distribution (an Dirichlet Multinomial model (MD) for each gene) where the total count for the gene is considered fixed (Love et al, 2018; Nowicka et al, 2016). The filtering was done with DRIMSeq package with both, the minimal number of samples where the genes should be expressed and the the minimal number of samples where the features should be expressed, set to 3 (the number of replicates on day 0 and 5). Additional filters were also added: it was required for a transcript to (1) have count of at least 10 in at least 3 samples, (2) have a relative abundance proportion of at least 0.1 in at least 3 samples, and (3) have the total count of 10 in the corresponding gene in all 3 samples. The filtered DRIMSeq object was used for DEXSeq workflow, as the filtering was easier, and the analysis was sped up. The design formula was taken as ~sample + exon + day:exon (same as for DGE analysis). DEXSeq analysis included estimation of size factors, estimation of dispersion, and likelihood ratio tests for differential exon usage. The results tables (log2 fold changes and p-values) were generated and per-gene adjusted p-value were computed. stageR package was used for post-processing of the calculated p-values through stage-wise analysis of data, by screening first at the gene level for evidence of DTU (screening stage) and confirms which trancsripts within those significant genes show evidence of DTU (confirmation stage). stageR analysis was performed with overall false discovery rate (OFDR, alpha parameter) equal to 0.05. A custom function plotExpression was made to plot the transcript expression of a significant gene [Lin, 2021]. UpsetR plot and Venn diagrams were used to visualize the overlap between the transcripts identified by the two methods.

Results

Exploratory Comparison of Illumina+Salmon and PacBio+Bambu

Load data

The data is loaded from RDS-object files from bambu, salmon and the metadata (from import_processing.qmd). For each of the 4 NDR thresholds (0.025, 0.05, 0.1, 0.2), two data frames exist containing the expression data (counts) as well as transcript-per-million (TPM) of the two different technologies used to obtain the data:

1. Illumina sequencing and Salmon quantification (short read)
2. PacBio sequencing and Bambu quantification (long read)

In total, this adds up to 16 different data frames. Each of these data frames contains 11 samples, which are replicates from one of five different time points. Each time point has 1-3 replicates.

Differences

The differences between Illumina+Salmon and PacBio+Bambu have been computed in a direct comparison and it was calculated how many percent had a difference smaller than 50 (just to get an overview of the raw differences).

[1] "Differences transcript level (0.025, 0.05, 0.1, 0.2):"
[1] 0.9232244
[1] 0.9232975
[1] 0.9234513
[1] 0.9238523
[1] "Differences gene level (0.025, 0.05, 0.1, 0.2):"
[1] 0.8487736
[1] 0.8491386
[1] 0.8496814
[1] 0.851743

On the transcript level the differences are under 50 for always around 92% of the transcripts for all the NDR thresholds and for the gene level around 85% for all the NDR thresholds.

Histograms of differences

This was then visualized for always the first column. Here we only show the plots for the 0.025 and 0.05 thresholds on transcript and gene level because all 8 histograms look almost the same. The rest can be seen in the comparison_pacbio_illumia.qmd file.

As one can see, the most differences are around 0 (highest bar) with small bars around it. Thus, when just looking at the raw data the difference between Illumina+Salmon and PacBio+Bambu is small.

Correlation and stratification by length

Next, the correlation between Illumina+Salmon and PacBio+Bambu was computed. For that, first the transcript lengths were exported as well as their distribution looked at to be able to divide them into sections for the stratification by transcript length.

As seen in the above plot, most transcripts have a length between 0 and around 2000 and at around 10000 almost no transcripts can be found. So, we decided on dividing the lengths into intervals of 1000 from 0 to 10000. Here it is of note, that the intervals do not contain the same amount of transcripts so the bins have different frequencies.

So first, the correlation not stratified by length was plotted. We performed both pearson- and spearman-correlation but we will only show the spearman plot of the first NDR threshold for transcript and gene level (explained in discussion).

In the spearman heat maps, there is a lower correlation on the gene level compared to the transcript level. Also the day 2 and day 3 bambu samples seem to be highly correlated also with the day 4 and day 5 salmon samples.

Then, the correlation stratified by length was plotted for all 4 NDR thresholds. For the purpose of no further information gain we will only show one pearson plot and then representatively one spearman heat map (for one bin) for each NDR threshold (we chose 1000-2000 as its the most abundant transcript length) as well as all 4 scatter plots of the summed up correlation stratified by length. The individual heatmaps are not shown in the report but can be commented out in the .qmd file for visualization and saving space.

The pearson plots do not look like usual heat map as the day 2 - rep 1 - bambu is highly correlated with all samples of salmon and the rest of the correlations seem quite low. When observing the 0.025 heat map with spearman correlation one sees most of the samples being highly correlated with itself (or the replicates) and a bit clearer the highly correlated diagonal usually seen in heat maps.

The rest of the heat maps for the other thresholds (0.05, 0.1, 0.2) look very similar to the 0.025 heat map. Looking at the scatter plots of summed up correlations stratified by range of transcript lengths they also look almost the same between the different thresholds. One can observe the clear trend that with a higher transcript length the correlation is also higher but for very small transcript lengths (<1000) the correlation is very low.

MDS

The MDS plots look very similar between all the 4 NDR thresholds. When comparing the MDS plots between Salmon and Bambu the MDS1 separation by day is almost the same with the MDS2 separation being a bit more different. In Bambu the replicated are grouped a bit closer together then in Salmon and fewer samples are above 0.

Differential Gene Expression and Gene Set Analyses

Since the DGE analysis of the different NDR thresholds showed very similar results, only the 0.2 NDR threshold is shown here. The analysis of all 4 NDR thresholds can be found in the files dge_analysis.html and dge_analysis.qmd.

Load data

The data is loaded from RDS-objects that have been created with the import_processing.qmd file. For each of the 4 NDR thresholds (0.025, 0.05, 0.1, 0.2), two data frames containing the expression data (counts) of the two different technologies exist:

1. Illumina sequencing and Salmon quantification (short read)
2. PacBio sequencing and Bambu quantification (long read)

In total, this adds up to 8 different data frames. Each of these data frames contains 11 samples, which are replicates from one of six different time points. Each time point has 1-3 replicates. As stated above, in this file only the results of the 0.2 NDR threshold data is shown. For the DGE analysis, the counts are loaded. For the investigation of genes of interest, the TPMs are loaded.

   sampleID condition time
1         1 day0.rep1    0
2         2 day0.rep2    0
3         3 day0.rep3    0
4         4 day1.rep1    1
5         5 day2.rep1    2
6         6 day3.rep1    3
7         7 day3.rep2    3
8         8 day4.rep1    4
9         9 day5.rep1    5
10       10 day5.rep2    5
11       11 day5.rep3    5

Genes of Interest

To investigate the expression changes in the genes of interest (see above), their average expression at each time point was computed and then plotted.

# calculate average expression per time point for Salmon and Bambu

## Salmon
# calculate the average expression of each gene at each time point (average of the replicates)
avg_expr_salmon <- as.data.frame(sapply(sort(unique(meta_samples$time))[c(1,4,6)], function(time) {
  rowMeans(salmon_0.2_tpm_gene[, which(meta_samples$time == time)])
}))

# for time points 1, 2 and 4, only one replicate exists
avg_expr_salmon$V4 <- unlist(as.vector(salmon_0.2_tpm_gene[, which(meta_samples$time == 1)]))
avg_expr_salmon$V5 <- unlist(as.vector(salmon_0.2_tpm_gene[, which(meta_samples$time == 2)]))
avg_expr_salmon$V6 <- unlist(as.vector(salmon_0.2_tpm_gene[, which(meta_samples$time == 4)]))

avg_expr_salmon <- avg_expr_salmon[, c(1,4,5,2,6,3)]

colnames(avg_expr_salmon) <- sort(unique(meta_samples$time))

## Bambu
# calculate the average expression of each gene at each time point (average of the replicates)
avg_expr_bambu <- as.data.frame(sapply(sort(unique(meta_samples$time))[c(1,4,6)], function(time) {
  rowMeans(bambu_0.2_tpm_gene[, which(meta_samples$time == time)])
}))

# for time points 1, 2 and 4, only one replicate exists
avg_expr_bambu$V4 <- unlist(as.vector(bambu_0.2_tpm_gene[, which(meta_samples$time == 1)]))
avg_expr_bambu$V5 <- unlist(as.vector(bambu_0.2_tpm_gene[, which(meta_samples$time == 2)]))
avg_expr_bambu$V6 <- unlist(as.vector(bambu_0.2_tpm_gene[, which(meta_samples$time == 4)]))

avg_expr_bambu <- avg_expr_bambu[, c(1,4,5,2,6,3)]

colnames(avg_expr_bambu) <- sort(unique(meta_samples$time))

The first gene investigated was CTNNB1 (catenin beta 1). This gene was targeted by the treatment on days 0 and 1. Here, the addition of GSKi activated Wnt-signaling which in turn activated B-catenin providing transcriptional regulation. A trend is visible showing an increase in expression in the first days, and then the expression reaches a plateau for the last days, agreeing with the addition of GSKi on days 0 and 1. On days 3 - 5, the treatment continues with the addition of BMP4 and VEGF. BMP4 induces the transition of mesodermal cells into endothelial lineages through usage of upstream Indian-Hedgehog signaling and downstream Smad signaling. Moreover, the gene ETV2 (ETS variant transcription factor 2) is activated due to the addition of VEGF which is added starting from day 3 as described above. Here, almost no expression can be detected before day 3, followed by a major increase on day 3 and decreasing expression on days 4 and 5.

When comparing the expression levels from the Illumina+Salmon data and from the PacBio+Bambu data, the general trend of the expression levels seem to be similar. Furthermore, the expression levels are in agreement with the treatment protocol.

Create DGEList Objects and Filter Expressed Genes

The expression data was then filtered using the function filterByExpr to exclude genes that do not show high enough expression or do not show expression in enough samples.

day <- factor(meta_samples$time)
y_salmon_0.2 <- DGEList(counts = salmon_0.2_gene, group = day)
y_bambu_0.2 <- DGEList(counts = bambu_0.2_gene, group = day)

expressed_salmon <- filterByExpr(y_salmon_0.2)
expressed_bambu <- filterByExpr(y_bambu_0.2)

not_expressed <-
  intersect(names(expressed_salmon[!expressed_salmon]), names(expressed_bambu[!expressed_bambu]))

if (length(not_expressed) > 0) {
  y_salmon_0.2 <-
    y_salmon_0.2[-which(rownames(salmon_0.2_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
  y_bambu_0.2 <-
    y_bambu_0.2[-which(rownames(bambu_0.2_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
}

Differential Gene Expression Analysis

The DGE analysis aims to identify genes that are DE over the time course. These results were computed here only for the 0.2 NDR threshold of both technologies, and were then overlayed to compare the two technologies, so the long-read and short-read sequencing.

The design matrix was created in a way that the first time point, day0, was used as a baseline (intercept of the model).

# create design matrix
design <- model.matrix( ~ day)
design
   (Intercept) day1 day2 day3 day4 day5
1            1    0    0    0    0    0
2            1    0    0    0    0    0
3            1    0    0    0    0    0
4            1    1    0    0    0    0
5            1    0    1    0    0    0
6            1    0    0    1    0    0
7            1    0    0    1    0    0
8            1    0    0    0    1    0
9            1    0    0    0    0    1
10           1    0    0    0    0    1
11           1    0    0    0    0    1
attr(,"assign")
[1] 0 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$day
[1] "contr.treatment"

The differential expression analysis was performed with edgeR, and genes were considered as differentially expressed if the adjusted p-value of the likelihood ratio test between the full and the reduced model was below 0.01. The p-value adjustment was performed using the Benjamini-Hochberg method.

# Salmon 0.2

y_salmon_0.2 <- calcNormFactors(y_salmon_0.2)
## estimate dispersion
y_salmon_0.2 <- estimateDisp(y_salmon_0.2, design)
## fit the negative binomial model
fit <- glmFit(y_salmon_0.2, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtS <- glmLRT(fit, coef = 2:6)
## save lrt for pathway analysis
lrt_pathway <- lrtS
## multiple testing correction
decide_dif_s0.2 <-
  decideTests.DGELRT(lrtS,
                     adjust.method = "BH",
                     p.value = 0.01)
summary(decide_dif_s0.2)
       day5-day4-day3-day2-day1
NotSig                    18289
Sig                        7723
# Bambu 0.2

y_bambu_0.2 <- calcNormFactors(y_bambu_0.2)
## estimate dispersion
y_bambu_0.2 <- estimateDisp(y_bambu_0.2, design)
## fit the negative binomial model
fit <- glmFit(y_bambu_0.2, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtB <- glmLRT(fit, coef = 2:6)
## multiple testing correction
decide_dif_b0.2 <-
  decideTests.DGELRT(lrtB,
                     adjust.method = "BH",
                     p.value = 0.01)
summary(decide_dif_b0.2)
       day5-day4-day3-day2-day1
NotSig                    17667
Sig                        8345

For Illumina+Salmon, out of the 26012 tested genes, 7723 were identified as DE. In the case of PacBio+Bambu, 8345 genes were identified as DE. Both methods identified roughly the same amount of genes as DE.

From the UpSet plot above it is visible that, even though identifying roughly the same number of genes as DE, the two technologies show some differences in their results. The majority of genes identified overlap, however, there are still some genes only called DE by either Illumina+Salmon or by PacBio+Bambu. To further investigate the similarities/differences, the logFC and p-values of the analysis were compared. The p-values were calculated for the likelihood ratio test, the logFC values, however, were calculated for the discrete comparisons of the days 1-5 with day 0. Here, the logFC values of day 5 were compared.

The comparison of the p-values after adjustment shows a similar trend in both technologies, but also some differences. The same is observable for the logFC values. Additionally, it is visible that many genes that show a logFC value of zero in the analysis of one technology show a (high) logFC value in the other one. This agrees with the observation made in the UpSet plot, that some genes are called differentially expressed by either of the two technologies, but not by both.

The DGE analysis mainly showed two things. First, it showed that the results of the different NDR thresholds were very similar (not shown here). Second, it showed that the results of the two different technologies, Illumina+Salmon and PacBio+Bambu, were quite similar, but still showed some differences. When treating the Illumina+Salmon data as the ground truth, it can be said that PacBio+Bambu is close to the truth, but still shows some deviations.

Following a differential gene expression analysis, the identified genes were grouped into gene sets of similar genes and a gene set analysis was be conducted on these gene sets. To do so, only the genes that were detected by the analyses of both technologies were used, so the overlap which can be seen in the UpSet plot. Since the results of the different NDR thresholds were very similar, only one of the thresholds, namely the NDR threshold of 0.2, was used.

Grouping of Differentially Expressed Genes

To group the DE genes, the average expression of each gene at each time point was calculated first (for more details, see calculation of Results - Differential Gene Expression and Gene Set Analyses - Genes of Interest).

The genes were then clustered using the their correlation as a distance measure (to be more precise: 1-correlation). The cluster dendrogram of the hierarchical clustering as well as a heat map of the correlation including the dendrogram are shown below.

corr_DEG <- cor(t(avg_expr[DEG,]), method = "spearman")
hcl_DEG <- hclust(as.dist(1 - corr_DEG), method = "complete")
cl_DEG <- cutree(hcl_DEG, k = 4)
heatmap.2(corr_DEG, Rowv = as.dendrogram(hcl_DEG), Colv = as.dendrogram(hcl_DEG),
          trace = "none", scale = "none", labRow = NA, labCol = NA, col = viridis,
          ColSideColors = rainbow(15)[cl_DEG])

The genes were then split into 4 clusters by visual inspection, which can be seen in the plot above. The bar indicates the 4 clusters by different colors (yellow, red, green and orange). The average expression of all genes of each cluster at each time point are shown below:

The expression pattern of cluster 1 seems to follow a downwards trend, meaning that cluster 1 might contain genes that are mostly down-regulated over the time course. Clusters 2 and 3 show an upwards trend in expression, so these clusters might contain mostly genes that are up-regulated over the time course. Cluster 4 does not follow an apparent trend, there are many outliers visible at all time points and the vertical position of the boxes seem to fluctuate rather than following a trend. Moreover, this cluster is also the smallest one. For the following gene set analysis, the 4 clusters were used as gene sets.

Gene Set Analysis

The gene set analysis was performed using the method DAVID, which is a frequency-based method and computes which gene ontology terms are significantly enriched in a gene set compared to a background.

To perform DAVID on the 4 gene sets, the ensemble gene ids of the genes were extracted and then written into text files, which were then uploaded to the DAVID web tool. Since some of the newly discovered transcripts could not be mapped to an existing gene, there were some genes in the gene sets that did not have an ensemble gene id compatible with DAVID. These genes were therefore removed from the analysis. As explained in the methods part, the background list contained all genes of the 0.2 NDR threshold data after filtering, with the exception of newly identified genes.

The results were then downloaded from the DAVID web tool as text files and contain terms describing the functions or pathways different genes are involved in. The results also contain other measures, like the count of genes of the input gene set that are associated with a specific term, as well as adjusted p-values showing the significance of this term in the input gene set. The results were then filtered by the adjusted p-values. Following, the results were sorted by their adjusted p-values in ascending order. Below, the top 15 most enriched terms are shown for each cluster:

# import DAVID results
cluster1 <- read.csv("../DAVID/C1_expressed_background.txt", sep = "\t")
cluster2 <- read.csv("../DAVID/C2_expressed_background.txt", sep = "\t")
cluster3 <- read.csv("../DAVID/C3_expressed_background.txt", sep = "\t")
cluster4 <- read.csv("../DAVID/C4_expressed_background.txt", sep = "\t")
# get significant terms
cluster1_sig <- cluster1 %>%
  filter(Benjamini < 0.01) %>%
  arrange(Benjamini)

cluster1_terms <- cluster1_sig$Term
print(cluster1_terms[1:15])
 [1] "KW-0770~Synapse"                                                  
 [2] "KW-0628~Postsynaptic cell membrane"                               
 [3] "KW-1003~Cell membrane"                                            
 [4] "hsa04727:GABAergic synapse"                                       
 [5] "hsa04723:Retrograde endocannabinoid signaling"                    
 [6] "KW-0325~Glycoprotein"                                             
 [7] "IPR018000:Neurotransmitter-gated ion-channel, conserved site"     
 [8] "IPR006029:Neurotransmitter-gated ion-channel transmembrane domain"
 [9] "IPR006202:Neurotransmitter-gated ion-channel ligand-binding"      
[10] "IPR006201:Neurotransmitter-gated ion-channel"                     
[11] "GO:0030594~neurotransmitter receptor activity"                    
[12] "hsa05033:Nicotine addiction"                                      
[13] "hsa05032:Morphine addiction"                                      
[14] "GO:0045202~synapse"                                               
[15] "KW-1015~Disulfide bond"                                           
cluster2_sig <- cluster2 %>%
  filter(Benjamini < 0.01) %>%
  arrange(Benjamini)

cluster2_terms <- cluster2_sig$Term
print(cluster2_terms[1:15])
 [1] "KW-0325~Glycoprotein"                    
 [2] "KW-1015~Disulfide bond"                  
 [3] "CARBOHYD:N-linked (GlcNAc...) asparagine"
 [4] "KW-0732~Signal"                          
 [5] "KW-0964~Secreted"                        
 [6] "GO:0005615~extracellular space"          
 [7] "KW-0217~Developmental protein"           
 [8] "KW-9996~Developmental protein"           
 [9] "GO:0005886~plasma membrane"              
[10] "GO:0005576~extracellular region"         
[11] "KW-0371~Homeobox"                        
[12] "IPR020479:Homeodomain, metazoa"          
[13] "IPR001356:Homeodomain"                   
[14] "IPR017970:Homeobox, conserved site"      
[15] "SM00389:HOX"                             
cluster3_sig <- cluster3 %>%
  filter(Benjamini < 0.01) %>%
  arrange(Benjamini)

cluster3_terms <- cluster3_sig$Term
print(cluster3_terms[1:15])
 [1] "GO:0005886~plasma membrane"               
 [2] "KW-0325~Glycoprotein"                     
 [3] "KW-1015~Disulfide bond"                   
 [4] "CARBOHYD:N-linked (GlcNAc...) asparagine" 
 [5] "KW-0472~Membrane"                         
 [6] "KW-1003~Cell membrane"                    
 [7] "KW-0732~Signal"                           
 [8] "TOPO_DOM:Cytoplasmic"                     
 [9] "KW-0130~Cell adhesion"                    
[10] "KW-0964~Secreted"                         
[11] "TOPO_DOM:Extracellular"                   
[12] "GO:0009986~cell surface"                  
[13] "KW-0037~Angiogenesis"                     
[14] "GO:0016021~integral component of membrane"
[15] "GO:0070062~extracellular exosome"         
cluster4_sig <- cluster4 %>%
  filter(Benjamini < 0.01) %>%
  arrange(Benjamini)

cluster4_terms <- cluster4_sig$Term
print(cluster4_terms[1:15])
 [1] "GO:0005730~nucleolus"                                                         
 [2] "GO:0003723~RNA binding"                                                       
 [3] "GO:0005654~nucleoplasm"                                                       
 [4] "KW-0539~Nucleus"                                                              
 [5] "GO:0005694~chromosome"                                                        
 [6] "GO:0006364~rRNA processing"                                                   
 [7] "KW-0597~Phosphoprotein"                                                       
 [8] "KW-0007~Acetylation"                                                          
 [9] "KW-0698~rRNA processing"                                                      
[10] "KW-0690~Ribosome biogenesis"                                                  
[11] "KW-0235~DNA replication"                                                      
[12] "GO:0032040~small-subunit processome"                                          
[13] "CROSSLNK:Glycyl lysine isopeptide (Lys-Gly) (interchain with G-Cter in SUMO2)"
[14] "KW-0067~ATP-binding"                                                          
[15] "GO:0006260~DNA replication"                                                   

The significant results for gene set 1 include associations with neurological functions and pathways, for instance signalling at the neurological synapse. In total, 30 terms were identified as significantly enriched. For gene set 2, many significantly enriched terms were found (in total 250), which might be due to the large size of this gene set. The most significant terms include for instance Glycoprotein, Developmental Protein, Homeobox/domain and Extracellular Matrix. The results of gene set 3 include associations with the cell surface and the cell membrane as well as cellular signalling and cell adhesion. Here, 48 terms were found to be significantly enriched. Lastly, the results of gene sets 4 again include more terms then the ones of gene sets 1 and 3, but less then gene set 2 (in total 130). This seems to be surprising when just looking at the sizes of the genes sets, since gene set 4 is with 680 genes by far the smallest gene set. The significant terms of gene set 4 include functions related to the nucleus, DNA replication, transcription and translation.

When looking at all functions or pathways that DAVID found an association with, it seems like many different cellular components are involved in the changes over the course of time. Especially for the larger clusters, many different gene ontology terms were significant. From this analysis, it is difficult to make assumptions about specific pathways or cellular functions that are up- or down-regulated over the course of time. However, it can be said that the treatment of the cells with differentiation factors possible effects many components of the cells.

Differential Transcript Usage Analysis

Load data

The data is loaded from RDS-objects that have been created with the import_processing.qmd file. For each of the 4 NDR thresholds (0.025, 0.05, 0.1, 0.2), two data frames exist containing the expression data (counts) of the two different technologies used to obtain the data:

1. Illumina sequencing and Salmon quantification (short read)
2. PacBio sequencing and Bambu quantification (long read)

In total, this adds up to 8 different data frames. Each of these data frames contains 11 samples, which are replicates from one of five different time points. Each time point has 1-3 replicates.

Making a function for the pipeline using DEXSeq x DRIMSeq

Adapted from Lin (2021).

# `samps` dataframe relates the sample identifiers to the conditions (in this case, day of the experiment)
samps <- df_list_meta$metadata
samps <- samps %>% column_to_rownames(var="sampleID")
colnames(samps) <- c("sample_id", "day")
samps$day <- factor(samps$day)

DTU <- function(data, txdf, sample_name, n = 11, n.small = 11){
  # Counts
  cts <- data
  colnames(cts) <- samps$sample_id
  # `counts` with the gene ID, the feature (transcript) ID, and then columns for each of the samples is built for dmDSdata object
  counts <- data.frame(gene_id=txdf$gene_id,
                     feature_id=rownames(txdf),
                     cts)
  # Creating dmDSdata object
  d <- dmDSdata(counts=counts, samples=samps)
   # Filtering
  d <- dmFilter(d,
                min_samps_feature_expr=n.small, min_feature_expr=10,
                min_samps_feature_prop=n.small, min_feature_prop=0.1,
                min_samps_gene_expr=n, min_gene_expr=10)
  # Visualization of filtered dataframe
  print(plotData(d))
  # The count of isoforms of the filtered genes
  # print(table(table(counts(d)$gene_id)))
  # DEXSeq pipeline
  sample.data <- DRIMSeq::samples(d)
  count.data <- round(as.matrix(counts(d)[,-c(1:2)]))
  dxd <- DEXSeqDataSet(countData=count.data,
                       sampleData=sample.data,
                       design=~sample + exon + day:exon,
                       featureID=counts(d)$feature_id,
                       groupID=counts(d)$gene_id)
  dxd <- estimateSizeFactors(dxd)
  dxd <- estimateDispersions(dxd, quiet=TRUE)
  dxd <- testForDEU(dxd, reducedModel=~sample + exon)
  dxr <- DEXSeqResults(dxd, independentFiltering=FALSE)
  qval <- perGeneQValue(dxr)
  # Per gene
  dxr.g <- data.frame(gene=names(qval),qval)
  # Per transcript
  dxr.t = as.data.frame(dxr[, c("featureID","groupID","pvalue")])
  # Number of identified genes showing evidence for DTU
  print(paste0("Number of identified genes showing evidence for DTU through DEXSeq: ", nrow(dxr.g[dxr.g$qval < 0.05,])))
  # Number of transcripts involved
  print(paste0("Number of transcripts involved through DEXSeq: ", nrow(dxr[dxr$padj < 0.05,])))

  # stageR
  strp <- function(x) substr(x,1,15)
  # Vector of per-gene p-values for the screening stage
  pScreen <- qval
  names(pScreen) <- strp(names(pScreen))
  # One column matrix of the confirmation p-values for confirmation stage
  pConfirmation <- matrix(dxr$pvalue,ncol=1)
  dimnames(pConfirmation) <- list(strp(dxr$featureID),"transcript")
  # two column dataframe `tx2gene`with the transcript and gene identifiers
  tx2gene <-  data.frame(dxr.t[,c("featureID", "groupID")], dxr.t[,c("featureID", "groupID")])
  # OFDRm, alpha parameter equal to 0.05
  for (i in 1:2) tx2gene[,i] = strp(tx2gene[,i])
  stageRObj <- stageRTx(pScreen=pScreen, pConfirmation=pConfirmation,
                        pScreenAdjusted=TRUE, tx2gene=tx2gene[1:2])
  stageRObj = stageRTx(pScreen = pScreen, 
  pConfirmation = pConfirmation, 
  pScreenAdjusted = TRUE, 
  tx2gene = tx2gene[1:2])
  
  stageRObj <-  stageWiseAdjustment(stageRObj, method = "dtu", alpha = 0.05)
  
  suppressWarnings({
    dex.padj <- getAdjustedPValues(stageRObj, order=FALSE,
                                   onlySignificantGenes=TRUE)
  })
  
  dex.padj = merge(tx2gene, dex.padj, by.x = c("groupID","featureID"), by.y = c("geneID","txID"))
  # print(head(dex.padj[order(dex.padj$gene, decreasing = T), ]))
  
  # Normalize counts
  dex.norm = cbind(as.data.frame(stringr::str_split_fixed(rownames(counts(dxd)), ":", 2)), as.data.frame(counts(dxd, normalized = TRUE))[,1:20])
  colnames(dex.norm) = c("groupID", "featureID", as.character(colData(dxd)$sample_id)[1:20])
  row.names(dex.norm) = NULL

  # Per-group normalised mean
  dex.mean = as.data.frame(sapply( levels(samps$day), 
  function(lvl) rowMeans(dex.norm[, 3:ncol(dex.norm)][, samps$day == lvl, drop = FALSE]) ))
  colnames(dex.mean) <- paste0("mean.day", colnames(dex.mean))
  
  # log2 fold change in expression
  dex.log2fc = log2(dex.mean[2]/dex.mean[1])
  colnames(dex.log2fc) = "log2fc"
  rownames(dex.log2fc) = dex.norm$featureID
  
  # Merge to create result data
  dexData = cbind(dex.norm[,1:2], dex.mean, dex.norm[, 3:ncol(dex.norm)])
  colnames(dexData)[1:2] <-  c("GeneID","TranscriptID")
  dexData = dexData[order(dexData$GeneID, dexData$TranscriptID),]
  
  # Merge to create result data
  dexDTU = merge(dex.padj[,c("featureID.1","groupID.1","gene","transcript")], dex.log2fc, by.x = "featureID.1", by.y = "row.names")
  colnames(dexDTU)[1:2] <-  c("TranscriptID","GeneID")
  dexDTU = dexDTU[order(dexDTU$GeneID, dexDTU$TranscriptID),]

  # Plot the normalised counts for one of the significant genes, where we can see evidence of switching
  gene_id <-  unique(dex.padj[order(dex.padj$transcript, dex.padj$gene),]$groupID.1)[1]
  print(paste0("Significant gene chosen for visualization: ", gene_id))
  
  print(plotExpression(dex.norm, gene_id, samps, isProportion = T))
  
  return(dexDTU)
}

Analysis for NDR 0.2

[1] "Number of identified genes showing evidence for DTU through DEXSeq: 1578"
[1] "Number of transcripts involved through DEXSeq: 2569"
[1] "Significant gene chosen for visualization: ENSG00000010671.17"

[1] "Number of identified genes showing evidence for DTU through DEXSeq: 3530"
[1] "Number of transcripts involved through DEXSeq: 7799"
[1] "Significant gene chosen for visualization: ENSG00000004864.14"

[1] 2528
[1] 7108

After DRIMSeq filtering, 10364 (16.47%) of genes and 28447 (11.42) % of features remained in Illumina+Salmon data for further analysis. For PacBio+Bambu 7847 (12.47%) genes and (7.95) features passed the filter.

For Illumina+Salmon, out of the 10364 tested genes and 28447 tested features, DEXSeq test identified 1578 (15.23%) genes showing evidence of isoform switching involving 2569 (9.03%) transcripts. For PacBio+Bambu out of the 7847 tested genes and 19816 tested features, 3530 genes (44.99%) involving 7799 (39.36%) transcripts were identified.

After the stageR procedure, 2528 transcripts (8.89% of the filtered) passed the confirmation stage for Illumina+Salmon and 7108 transcripts (35.87% of the filtered) for PacBio+Bambu on a target 5% overall false discovery rate (OFDR). This means that, in expectation, no more than 5% of the genes that pass screening will either (1) not contain any DTU, so be falsely screened genes, or (2) contain a falsely confirmed transcript.

The genes were ordered according to their adjusted p-values for transcript and genes after stageR procedure. The first significant gene was chosen for visualization. For Illumina+Salmon data we can see that the isoform ENST00000308731.8 from gene ENSG00000010671.17 starts getting transcribed by the end of the treatment starting from day 3 and increases signiicantly. Meanwhile, the transcription of the isoform ENST00000621635.4 stays approximately the same. For PacBio+Bambu we can see that the counts for isoform ENST00000265631.10 from gene ENSG00000004864.14 drop to 0 on day 5, meanwhile, they increase for isoform ENST00000416240.6.

Upset Plot

From the UpSet plot and Venn Diagram above it is visible that through PacBio+Bambu many more transcripts were detected through DTU. The overlap between Illumina+Salmon was also only roughly 7%. This percentage is less than for the thresholds for NDR 0.025, 0.05 and 0.1 which showed the overlap of roughly 11%.

Comparsion of pAdj and logFC

The comparison of the adjusted p-values for transcripts shows low correlation between technologies especially for higher values. The logFC values seem to be, however, better correlated with some deviations for the extreme values.

Discussion

Exploratory Comparison of Illumina+Salmon and PacBio+Bambu

With the exploratory analysis of the data the goal was to get a first overview of both data sets as well as the differences in the raw data between Illumina+Salmon and PacBio+Bambu for all 4 different NDR cutoffs and correlations between the two methods stratified by length.

Firstly, the results for all 4 thresholds are very similar which is why we decided to also keep in only the results for one/two different NDR thresholds for this part of the report as we gain no extra information when including all thresholds and this would take up too much space. The only small differences can be explained by the fact that the NDR thresholds determine the number of newly detected genes which doesn’t have an effect on the comparison of both methods. Secondly, we performed the analysis on count and TPM data but decided to keep in only the result for TPM as we stratify by transcript length therefore making these results more interesting, also because the further analysis steps are also performed on TPM rather than the raw count data. For completeness the count results can be found in a separate file(comparison_pacbio_illumina.qmd).

Differences

We wanted to further look at the differences between the Illumina data, so short read data produced by a thoroughly optimized and long-in-use method, and the long-read PacBio data, that has not been optimized well and is still suffering some biases. So we started with just looking at the difference in raw data. Looking at the percentages and the histograms the difference in raw data seems to be quite small the majority being <50 as well as the highest bar of the histograms being around 0. This was shown here only for the first column as all the other columns look very similar (no information gain). This would suggest that the Illumina and PacBio raw data seem to be quite similar with only minor deviations.

Correlation and stratification by length

Next, we looked at the correlation between Illumina and PacBio data. We performed both pearson and spearman correlation. When looking at the pearson results compared to spearman they look very unusual. One can’t say that each sample is highest correlated with itself and some samples are highly correlated with all samples of all days (usually sample day 1 or 2) whilst the spearman heat maps look as expected. This can be explained by the fact that spearman correlation is more robust to outliers than pearson and doesn’t assume a linear relationship which works better with our data. So we decided to only focus on the plots produced by the spearman correlation. Also, it is of note to mention that we decided to cut our data into intervals of transcript length 1000 which results in the bin 1000-2000 containing the most transcripts and making them an irregular size. But this evens out as both bambu and salmon (which we compare) have those bin sizes and in the scatter plots we sum the correlations up. Further, we did our analysis on the transcript and not on the gene level as we stratify it on the transcript length and this will be harder to do on the gene level and not as precise as one would need to use median transcript lengths. So, looking at the heat maps not stratified by length they look the same for all the NDR thresholds but the correlations are smaller on the gene level which could maybe be explained by the fact the we sum up the transcripts for the genes thus resulting in fewer data and a smaller correlation.The heat maps for the length stratification again look similar between the 4 NDR thresholds and show as expected highest correlation with their own sample between salmon and bambu as well as the replicates. This means that the individual samples between bambu and salmon do correlate well and are the most similar the their own sample in the other method and thus both methods seem to produce comparable results although the heat maps and correlation aren’t as pronounced. This may be due to noise and outliers which even the spearman correlation isn’t robust to.

Lastly, looking at the scatter plot which again is almost the same for all 4 NDR thresholds we have a clear trend of a better summed up correlation towards longer lengths but a very low correlation for short transcript lengths (<=700 bp). This is due to the dropout in the PacBio data as it does not detect so short transcript lengths. The trend towards longer transcript lengths could be due to assignment of PacBio reads to transcripts getting more accurate with longer transcripts. This means that the PacBio and Illumina data are getting more correlated with longer transcripts and are less comparable with respect to shorter transcripts.

MDS

The MDS plots look again the same for all 4 NDR thresholds and are comparable between Illumina and PacBio. MDS1 separates the samples in both methods by days and MDS2 by replicates. For Salmon the replicates are more spread out than for Bambu but the overall trend is the same with the differences being explained by the difference in method again showing that the raw (in this case count data) is comparable between the both methods.

Differential Gene Expression and Gene Set Analyses

Differential Gene Expression Analysis

The aim of the DGE analysis was to identify genes that are DE over the course of time, and to compare these results of the Illumina+Salmon data (short-read data) with the results of the PacBio+Bambu data (long-read data) for 4 different NDR thresholds.

When comparing the different NDR thresholds, only minor differences could be detected. This was expected since the main difference between the data sets of the different thresholds were the number of newly detected genes. The more interesting part was the comparison between the short-read data and the long-read data. Here, the long-read data was produced using a new protocol, the MAS-ISO-seq protocol, which displays an optimization for PacBio sequencing. To evaluate this, the results of the PacBio data was compared to the Illumina data results. It could be shown that for both cases roughly the same amount of genes were identified as DE. However, when overlaying these genes, there were still many genes only called DE in one of the two cases. Moreover, when comparing the adjusted p-values and logFC values of the likelihood ratio tests, deviations between the two cases were visible. Nevertheless, a common trend could be detected and the majority of DE genes tend to be called by both methods, this shows that the analysis of the PacBio data already comes close to the truth, when treating Illumina as the truth. Due to the lack of PacBio data from older protocols, it is not possible to draw any conclusions regarding the improvement of this protocol compared to others.

Grouping of Differentially Expressed Genes

The grouping of DE genes was only an intermediate step between the DGE analysis and the gene set analysis. Nonetheless, it showed some interesting results and could possible influence the outcome of the gene set analysis. The number of clusters/sets the genes were grouped in (here 4) was choosen more or less arbitrarily, mostly guided by visual inspection. Since the number of clusters was rather small, and the number of genes that were grouped was rather large, most of the clusters were rather large as well. Since the clusters were used as input gene sets for the gene set analysis, this choice possible effected the outcome of the gene set analysis and will thus be discussed again in the context of the gene set analysis.

Gene Set Analysis

The gene set analysis was performed with the aim of making sense of the genes that the DGE analysis identified as DE. To do so, the detected DE genes were grouped into 4 gene sets, which were then used as input for the gene set analysis performed with the web tool DAVID.

For the gene set 1 and 3, around 30 to 50 gene ontology terms were found to be significantly enriched in the gene sets of about 1200 - 1500 genes. The terms found significant for these two gene sets were rather homogeneous. The same could be observed for gene set 4. Notably, the amount of terms that were detected to be significantly enriched in gene set 4 was much higher (130 terms) than in gene sets 1 and 3, even though gene set 4 was only around half the size of the other two (there are 680 genes in gene set 4). Assuming that more genes in a gene set usually lead to more terms to be found significant, this could mean that gene set 4 is less homogeneous than gene sets 1 and 3. In agreement with that, the average expression of genes in gene set 4 over time also shows more heterogeneity than observed in the other clusters. The largest gene set, gene set 2, shows the highest amount of significant terms (250 while containing 2158 genes). These results seem to be more heterogeneous than the results of the other 3 gene sets, which could be caused by the larger size of gene set 2 and/or by more heterogeneity between the genes of this gene set. To overcome this issue and get clearer results also for gene set 2, the number of clusters during the grouping of differentially expressed genes could have been choosen larger to create smaller and therefore possible more homogeneous gene sets. However, when cutting the tree with the cutree function, the largest cluster, so gene set 2, gets only notable smaller when setting the number of cluster to 13. At this point, many small clusters with less than 50 genes are created. Therefore, the number of clusters was left to be 4.

In general, the terms that were found to be significant by the gene set analysis for all gene sets cover a lot of different functions and pathways in the cell. This could be explained by the experimental conditions, and by the cell line that was used. The cell line, namely WTC-11 cells, are induced pluripotent stem cells and therefore posses embryonic stem cell-like properties and are of pluripotent nature. These cells were treated with differentiation factors, meaning that these factors drove the process of cellular differentiation, so the differentiation of a less specialized cell into a more specialized one. Thus, many different cellular function and pathways could be affected.

Differential Transcript Usage Analysis

The aim of the differential transcript usage analysis was to identify alternative splicing and isoform switches over the course of time, and to compare these results of the Illumina+Salmon data (short-read data) with the results of the PacBio+Bambu data (long-read data) for 4 different NDR thresholds.

When comparing the different NDR thresholds, only minor differences could be detected. Overall, the number of screened genes and confirmed transcripts increased with the threshold. This was expected as the number of detected genes also increased with increasing NDR threshold. The shorter reads from Illumina+Salmon data fail to span successive splice sites, which impairs the identification of isoform switching (Al’Khafaji et al, 2023) so that only roughly 10% of the filtered features passed the stageR confirmation stage from Illumina+Salmon data. Meanwhile, longer reads from PacBio+Bambu span the majority of the human transcripts roughly 35% of the filtered features passed this stage for PacBio+Bambu (Al’Khafaji et al, 2023). Overall, the detected number of features were much higher for PacBio+Bambu and overlap between transcripts showing isoform switching was only roughly 10%. Both methods complement each other in DTU.

Conclusion

In conclusion, the exploratory data analysis gave us an overview over the differences in the raw data between Illumina and PacBio and showed that the data has its differences but seems to be quite comparable. This is the case especially when looking at the raw data differences as well as the correlation with there still being difficulties in shorter transcript lengths. Further, the Differential Gene Expression (DGE) analysis demonstrated a comparable identification of differentially expressed genes between Illumina+Salmon and PacBio+Bambu data, with some deviations and a recognition of the inherent limitations. The subsequent Gene Set Analysis revealed diverse functional enrichments in gene sets, influenced by the experimental conditions. Lastly, the Differential Transcript Usage (DTU) analysis highlighted that the number of features were much higher for PacBio+Bambu than for Illumina+Salmon as the longer read length spans the majority of human transcripts. All in all, a thorough comparison of the two methods could be performed whilst highlighting different aspects of both methods.

Literature

Al’Khafaji AM, Smith JT, Garimella KV, Babadi M, Popic V, Sade-Feldman M, Gatzen M, Sarkizova S, Schwartz MA, Blaum EM et al (2023) High-throughput RNA isoform sequencing using programmed cDNA concatenation. Nat Biotechnol

Anders S, Reyes A, Huber W (2012) Detecting differential usage of exons from RNA-seq data. Genome Res 22: 2008-2017

Baid G, Cook DE, Shafin K, Yun T, Llinares-López F, Berthet Q, Belyaeva A, Töpfer A, Wenger AM, Rowell WJ et al (2023) DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat Biotechnol 41: 232-238

Baralle FE, Giudice J (2017) Alternative splicing as a regulator of development and tissue identity. Nat Rev Mol Cell Biol 18: 437-451

Benjamini Y, Hochberg Y (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc B 57: 289-300

Chen Y, Sim A, Wan YK, Yeo K, Lee JJX, Ling MH, Love MI, Göke J (2023) Context-aware transcript quantification from long-read RNA-seq data with Bambu. Nat Methods 20: 1187-1195

Conway JR, Lex A, Gehlenborg N (2017) UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33: 2938-2940

Hardwick SA, Joglekar A, Flicek P, Frankish A, Tilgner HU (2019) Getting the Entire Message: Progress in Isoform Sequencing. Front Genet 10

Lin H, 2021. Guide to RNA-seq Data Analysis. URL https://ycl6.gitbook.io/guide-to-rna-seq-analysis/differential-expression-analysis/differential-transcript-usage/dtu-using-dexseq.

Love M, Soneson C, Patro R (2018) Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification [version 3; peer review: 3 approved]. F1000Res 7

Marques-Coelho D, Iohan LdCC, Melo de Farias AR, Flaig A, Letournel F, Martin-Négrier M-L, Chapon F, Faisant M, Godfraind C, Maurage C-A et al (2021) Differential transcript usage unravels gene expression alterations in Alzheimer’s disease human brains. npj Aging Mech Dis 7: 2

Nowicka M, Robinson MD (2016) DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000Res 5: 1356

Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14: 417-419

Posit team (2023). RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA. URL http://www.posit.co/.

Robinson MD, McCarthy DJ, Smyth GK (2009) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26: 139-140

Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, Imamichi T, Chang W (2022) DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res 50: W216-w221

Wenger AM, Peluso P, Rowell WJ, Chang PC, Hall RJ, Concepcion GT, Ebler J, Fungtammasan A, Kolesnikov A, Olson ND et al (2019) Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 37: 1155-1162

Session Info and R Package Versions

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] ggbeeswarm_0.7.2            stageR_1.22.0              
 [3] DEXSeq_1.46.1               RColorBrewer_1.1-3         
 [5] AnnotationDbi_1.62.2        DESeq2_1.40.2              
 [7] SummarizedExperiment_1.30.2 GenomicRanges_1.52.1       
 [9] GenomeInfoDb_1.36.4         IRanges_2.34.1             
[11] S4Vectors_0.38.2            MatrixGenerics_1.12.3      
[13] matrixStats_1.1.0           Biobase_2.60.0             
[15] BiocGenerics_0.46.0         BiocParallel_1.34.2        
[17] DRIMSeq_1.28.0              ggvenn_0.1.10              
[19] readxl_1.4.3                gplots_3.1.3               
[21] cowplot_1.1.1               patchwork_1.1.3            
[23] pheatmap_1.0.12             viridis_0.6.4              
[25] viridisLite_0.4.2           UpSetR_1.4.0               
[27] edgeR_3.42.4                limma_3.56.2               
[29] lubridate_1.9.3             forcats_1.0.0              
[31] stringr_1.5.1               dplyr_1.1.4                
[33] purrr_1.0.2                 readr_2.1.4                
[35] tidyr_1.3.0                 tibble_3.2.1               
[37] ggplot2_3.4.4               tidyverse_2.0.0            

loaded via a namespace (and not attached):
 [1] rstudioapi_0.15.0       jsonlite_1.8.7          magrittr_2.0.3         
 [4] farver_2.1.1            rmarkdown_2.25          zlibbioc_1.46.0        
 [7] vctrs_0.6.4             memoise_2.0.1           Rsamtools_2.16.0       
[10] RCurl_1.98-1.13         htmltools_0.5.7         S4Arrays_1.0.6         
[13] progress_1.2.2          curl_5.1.0              cellranger_1.1.0       
[16] KernSmooth_2.23-22      htmlwidgets_1.6.3       plyr_1.8.9             
[19] cachem_1.0.8            lifecycle_1.0.4         pkgconfig_2.0.3        
[22] Matrix_1.6-4            R6_2.5.1                fastmap_1.1.1          
[25] GenomeInfoDbData_1.2.10 digest_0.6.33           colorspace_2.1-0       
[28] geneplotter_1.78.0      RSQLite_2.3.3           hwriter_1.3.2.1        
[31] filelock_1.0.2          labeling_0.4.3          fansi_1.0.5            
[34] timechange_0.2.0        httr_1.4.7              abind_1.4-5            
[37] mgcv_1.9-0              compiler_4.3.1          bit64_4.0.5            
[40] withr_2.5.2             DBI_1.1.3               biomaRt_2.56.1         
[43] rappdirs_0.3.3          DelayedArray_0.26.7     gtools_3.9.5           
[46] caTools_1.18.2          tools_4.3.1             vipor_0.4.5            
[49] beeswarm_0.4.0          glue_1.6.2              nlme_3.1-164           
[52] reshape2_1.4.4          generics_0.1.3          gtable_0.3.4           
[55] tzdb_0.4.0              hms_1.1.3               xml2_1.3.5             
[58] utf8_1.2.4              XVector_0.40.0          pillar_1.9.0           
[61] genefilter_1.82.1       splines_4.3.1           BiocFileCache_2.8.0    
[64] lattice_0.22-5          survival_3.5-7          bit_4.0.5              
[67] annotate_1.78.0         tidyselect_1.2.0        locfit_1.5-9.8         
[70] Biostrings_2.68.1       knitr_1.45              gridExtra_2.3          
[73] xfun_0.41               statmod_1.5.0           stringi_1.8.2          
[76] yaml_2.3.7              evaluate_0.23           codetools_0.2-19       
[79] cli_3.6.1               xtable_1.8-4            munsell_0.5.0          
[82] Rcpp_1.0.11             dbplyr_2.3.4            png_0.1-8              
[85] XML_3.99-0.16           parallel_4.3.1          blob_1.2.4             
[88] prettyunits_1.2.0       bitops_1.0-7            scales_1.3.0           
[91] crayon_1.5.2            rlang_1.1.2             KEGGREST_1.40.1